{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 提纲\n",
    "\n",
    "本次实验课的主要内容有\n",
    "- 简要介绍nltk\n",
    "- 简要介绍gensim\n",
    "- 通过以下三种方法实现一个简单的情感分类器\n",
    "    - ngram特征+朴素贝叶斯\n",
    "    - 情感词典\n",
    "    - 词向量+SVM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1 NLTK\n",
    "\n",
    "NLTK的全称是Natural Language Toolkit，是一个非常经典的自然语言处理工具。\n",
    "\n",
    "官网：http://www.nltk.org/\n",
    "\n",
    "NLTK可以处理自然语言处理中的绝大多数基础任务，并包含50多种常用语料库。\n",
    "\n",
    "- tokenize\n",
    "- 词性标注\n",
    "- 词形还原\n",
    "- 词干提取\n",
    "- 句法解析\n",
    "- 停用词\n",
    "- wordnet\n",
    "- ...\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenize\n",
    "\n",
    "首先来看NLP处理中最基本的步骤，做分词。我们对下面这句话最分词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "\"Two plus two is four, minus one that's three — quick maths. Every day man's on the block. Smoke trees. See your girl in the park, that girl is an uckers. When the thing went quack quack quack, your men were ducking! Hold tight Asznee, my brother. He's got a pumpy. Hold tight my man, my guy. He's got a frisbee. I trap, trap, trap on the phone. Moving that cornflakes, rice crispies. Hold tight my girl Whitney.\""
      ]
     },
     "metadata": {},
     "execution_count": 11
    }
   ],
   "source": [
    "my_string = \"Two plus two is four, minus one that's three — quick maths. Every day man's on the block. Smoke trees. See your girl in the park, that girl is an uckers. When the thing went quack quack quack, your men were ducking! Hold tight Asznee, my brother. He's got a pumpy. Hold tight my man, my guy. He's got a frisbee. I trap, trap, trap on the phone. Moving that cornflakes, rice crispies. Hold tight my girl Whitney.\"\n",
    "my_string"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**使用python最基础的分词方法来对下面这句话做分词**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Two', 'plus', 'two', 'is', 'four,', 'minus', 'one', \"that's\", 'three', '—', 'quick', 'maths.', 'Every', 'day', \"man's\", 'on', 'the', 'block.', 'Smoke', 'trees.', 'See', 'your', 'girl', 'in', 'the', 'park,', 'that', 'girl', 'is', 'an', 'uckers.', 'When', 'the', 'thing', 'went', 'quack', 'quack', 'quack,', 'your', 'men', 'were', 'ducking!', 'Hold', 'tight', 'Asznee,', 'my', 'brother.', \"He's\", 'got', 'a', 'pumpy.', 'Hold', 'tight', 'my', 'man,', 'my', 'guy.', \"He's\", 'got', 'a', 'frisbee.', 'I', 'trap,', 'trap,', 'trap', 'on', 'the', 'phone.', 'Moving', 'that', 'cornflakes,', 'rice', 'crispies.', 'Hold', 'tight', 'my', 'girl', 'Whitney.']\n"
     ]
    }
   ],
   "source": [
    "tokenized_by_space = my_string.split()\n",
    "print(tokenized_by_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**使用nltk的分词器来对下面这句话做分词**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "['Two', 'plus', 'two', 'is', 'four', ',', 'minus', 'one', 'that', \"'s\", 'three', '—', 'quick', 'maths', '.', 'Every', 'day', 'man', \"'s\", 'on', 'the', 'block', '.', 'Smoke', 'trees', '.', 'See', 'your', 'girl', 'in', 'the', 'park', ',', 'that', 'girl', 'is', 'an', 'uckers', '.', 'When', 'the', 'thing', 'went', 'quack', 'quack', 'quack', ',', 'your', 'men', 'were', 'ducking', '!', 'Hold', 'tight', 'Asznee', ',', 'my', 'brother', '.', 'He', \"'s\", 'got', 'a', 'pumpy', '.', 'Hold', 'tight', 'my', 'man', ',', 'my', 'guy', '.', 'He', \"'s\", 'got', 'a', 'frisbee', '.', 'I', 'trap', ',', 'trap', ',', 'trap', 'on', 'the', 'phone', '.', 'Moving', 'that', 'cornflakes', ',', 'rice', 'crispies', '.', 'Hold', 'tight', 'my', 'girl', 'Whitney', '.']\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "\n",
    "tokenized_by_nltk = word_tokenize(my_string)\n",
    "\n",
    "print(tokenized_by_nltk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**同理，可以用nltk实现分句**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "[\"Two plus two is four, minus one that's three — quick maths.\",\n",
       " \"Every day man's on the block.\",\n",
       " 'Smoke trees.',\n",
       " 'See your girl in the park, that girl is an uckers.',\n",
       " 'When the thing went quack quack quack, your men were ducking!',\n",
       " 'Hold tight Asznee, my brother.',\n",
       " \"He's got a pumpy.\",\n",
       " 'Hold tight my man, my guy.',\n",
       " \"He's got a frisbee.\",\n",
       " 'I trap, trap, trap on the phone.',\n",
       " 'Moving that cornflakes, rice crispies.',\n",
       " 'Hold tight my girl Whitney.']"
      ]
     },
     "metadata": {},
     "execution_count": 15
    }
   ],
   "source": [
    "from nltk.tokenize import sent_tokenize\n",
    "\n",
    "sentences = sent_tokenize(my_string)\n",
    "sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**做一个词频统计**\n",
    "\n",
    "词频（term frequency，TF）指的是某一个给定的词语在语料库中出现的次数。\n",
    "词频信息是反映语料库特征一个的基本统计量。\n",
    "- 政府工作报告中的高频词\n",
    "- 使用词频推断红楼梦后40回是否为续写\n",
    "- 通过词频的变化来观察社会的变迁\n",
    "- 英语考试中重点词汇\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAEaCAYAAAAWvzywAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAmMUlEQVR4nO3deZxU1Z338c+vu6Gh2XcaFBAVFFnUblRENC6JGxg10THRicYFJ5tmfcxkMpk4i5nMk0fHMYu7iSYhRmMGG3E3yqKI3cqOiqIIsstON/T2e/6o29g0DTTVVfdUV33fr1e9qureqjpfsfvXp8699xxzd0REJHfkhQ4gIiLxUuEXEckxKvwiIjlGhV9EJMeo8IuI5JiC0AFaonfv3j5kyJCk3ltVVUXHjh1TG0g5siJHJmRQDuVIZ46KioqN7t5nnx3unvG3kpIST1Z5eXnS700l5dhbJuTIhAzuytGUcuytNTmAcm+mpmqoR0Qkx6jwi4jkGBV+EZEco8IvIpJjVPhFRHJM2gq/mT1oZuvNbFGjbZeZ2WIzqzez0nS1LSIi+5fOHv9vgfOabFsEXArMSGO7e9lZUx9XUyIibULaCr+7zwA2Ndm21N3fSVebje2qqePGR8r5+vQN7NhdG0eTIiJtgnka5+M3syHANHcf2WT7y8D33b38AO+dDEwGKC4uLikrKzvk9n/8t09YurGGa8Z0YdKwTof8/lSqrKykqKgoaAblyLwMyqEc6cxRWlpa4e77Dqs3d1VXqm7AEGBRM9tfBkpb+jnJXrn73OK1PviWaT7uthe8urYuqc9IlWy4CjCVMiFHJmRwV46mlGNvunL3EJ19TF8Gdsln9dZdPLVgTeg4IiIZIasLf16ecVE0xHPPjOUN3zZERHJaOk/nnAK8Bgw3s1Vmdp2ZXWJmq4BxwFNm9my62m9w+uCO9O5cyNI125j93ifpbk5EJOOl86yeL7l7sbu3c/fD3P0Bd/9r9LjQ3fu5+7npar9B+3zjmlMHA3DPjPfT3ZyISMbL6qGeBledMpii9vnMXLaRJau3hY4jIhJUThT+7kXtubz0cADum7k8cBoRkbByovADXHfaEeQZlM1fzeotVaHjiIgEkzOF//CeRVwwqpjaeueh2R+EjiMiEkzOFH6AG08/EoApc1eybVdN4DQiImHkVOEfdVg3xg3txY7dtfzx9Y9CxxERCSKnCj/A5DOGAvDQ7A+ortXMnSKSe3Ku8H9mWB+G9+vCum27eXL+6tBxRERil3OF38y4fsIRANynaRxEJAflXOEH+PzxA+nXtZB31m3n5Xc3hI4jIhKrnCz87Qvy+Or4RK//3ld0QZeI5JacLPwAXz55EJ0LC3ht+ScsXLU1dBwRkdjkbOHv2qEdV4xNTONwr6ZxEJEckrOFH+Da046gIM+YvnANKzdVho4jIhKLnC78A7p3ZNKYAdTVOw/M0jQOIpIbcrrwA9wwIXFB16NvrGRLZXXgNCIi6ZfOFbgeNLP1Zrao0baeZva8mS2L7nukq/2WGjGgKxOO7k1VTR1/0DQOIpID0tnj/y1wXpNtPwRedPejgRej58FNPr1hGocP2VVTFziNiEh6pXPpxRnApiabPw/8Lnr8O+DidLV/KE47qjcjiruyccdu/vetj0PHERFJK0vnlAVmNgSY5u4jo+db3L17o/2b3b3Z4R4zmwxMBiguLi4pKytLKkNlZSVFRUUHfd2MFVXcOXcrA7rkc+e5vckzS6q91uZIN+XIrAzKoRzpzFFaWlrh7qX77HD3tN2AIcCiRs+3NNm/uSWfU1JS4skqLy9v0euqa+t83G0v+OBbpvnzi9cm3V5rc6SbcmRWBnflaEo59taaHEC5N1NT4z6rZ52ZFQNE9+tjbn+/2uXnce1p0TQOM3RBl4hkr7gL/5PA1dHjq4GpMbd/QFecNIguHQqY++Em3vpoc+g4IiJpkc7TOacArwHDzWyVmV0H/CfwWTNbBnw2ep4xOhcWcOXJgwH1+kUkexWk64Pd/Uv72XV2utpMha+OH8IDs5bzzOK1fLhxJ0N6dwodSUQkpXL+yt2m+nXtwMXHD8QdTeMgIllJhb8ZN0QXdD1WsZJNOzWNg4hkFxX+Zgzr14Uzh/dhV009D7/2Yeg4IiIppcK/H5NPPxKAh19bQVW1pnEQkeyhwr8fpwztyejDurFpZzWPv7kqdBwRkZRR4d8PM9szedsDM5dTV5++qS1EROKkwn8A5x3Xn8N6dOTDTyp5fsna0HFERFJChf8ACvLzuD6axuGeGcsb5hcSEWnTVPgP4vKxh9O9qB1vfbSF8hWaxkFE2j4V/oMoal/A35+iaRxEJHuo8LfAV8YNoX1BHi8sXcf7G3aEjiMi0ioq/C3Qp0shXzgxMY3D/TPV6xeRtk2Fv4WunzAUM/jLmx+zYfvu0HFERJKmwt9CR/bpzDnH9qO6tp7fvfph6DgiIklT4T8EN0YXdD0yZwWV1bWB04iIJEeF/xCUDO7BCYO6s7Wqhj+/sTJ0HBGRpAQp/GZ2s5ktMrPFZvbtEBmSYWZ7ev33z/qA2rr6wIlERA5d7IXfzEYCNwAnAWOAiWZ2dNw5kvXZEf0Z0quIVZureHqRpnEQkbYnRI//WGCOu1e6ey3wCnBJgBxJyc8zrp+Q6PXfq2kcRKQNsrgLl5kdC0wFxgFVwItAubt/q8nrJgOTAYqLi0vKysqSaq+yspKioqJWZW5qd53zD9PWs63aufWMHozsWxgkRzKUI7MyKIdypDNHaWlphbuX7rPD3WO/AdcBbwIzgLuBOw70+pKSEk9WeXl50u89kDuef8cH3zLNr3nw9aA5DpVyZFYGd+VoSjn21pocJDrV+9TUIAd33f0Bdz/R3U8HNgHLQuRoja+MG0KHdnn87Z0NvLtue+g4IiItFuqsnr7R/SDgUmBKiByt0bNTey4rORzQ5G0i0raEOo//L2a2BCgDvuHubXK+4+snHIEZTJ33Meu27QodR0SkRUIN9Uxw9xHuPsbdXwyRIRUG9+rEecf1p6bOeWj2h6HjiIi0iK7cbaWGdXn/8PoKduzWNA4ikvlU+FvphEE9OGlIT7bvquVPcz8KHUdE5KBU+FOgodf/4KwPqNE0DiKS4VT4U+CsY/pyZJ9OrN66i6cWrAkdR0TkgFT4UyAvz7ghmsbhHk3jICIZToU/RS4+YSC9OxeydM02Zr23MXQcEZH9UuFPkQ7t8vnq+CGALugSkcymwp9CV508mKL2+cxctpElq7eFjiMi0iwV/hTqVtSOvxubmMbhvpnq9YtIZlLhT7Frxx9Bfp5RNn81q7dUhY4jIrIPFf4UO7xnEReMKqa23nlw1geh44iI7EOFPw0a1uWdMvcjtlbVBE4jIrI3Ff40GDmwG6ce2Yud1XVM0TQOIpJhVPjT5Iao1//Q7A+ortU0DiKSOVT40+Qzw/owvF8X1m3bzdR5H4eOIyKyR6gVuL5jZovNbJGZTTGzDiFypJOZ7en13zdT0ziISOaIvfCb2UDgJqDU3UcC+cAVceeIw0VjBtC/awfeXbeDt9ZWh44jIgKEG+opADqaWQFQBKwOlCOt2hfk7ZnGYeo7O8OGERGJWIghCDO7GfgPoAp4zt2vbOY1k4HJAMXFxSVlZWVJtVVZWUlRUVEr0rbOzpp6JpdtYFed86vze9O/c0GwLBD+3yOTcmRCBuVQjnTmKC0trXD30n12uHusN6AH8BLQB2gH/C9w1YHeU1JS4skqLy9P+r2pctOUN33wLdP8ly8tCx0lI/493DMjRyZkcFeOppRjb63JAZR7MzU1xFDPOcAH7r7B3WuAJ4BTA+SIzcTRAwAom5+VI1oi0saEKPwfAaeYWZGZGXA2sDRAjticPqw3Re2Mt9du573120PHEZEcF3vhd/fXgceBN4GFUYZ7484Rp8KCfE4emDhjtWy+lmYUkbCCnNXj7v/i7se4+0h3/3t33x0iR5zGH54o/NMWrNY5/SISlK7cjcmovu3p2ak972/YydI1Gu4RkXBU+GNSkGecN7I/kOj1i4iEosIfo0kNZ/douEdEAlLhj9FJR/SkT5dCVm6qYv6qraHjiEiOOuTCb2Y9zGx0OsJku/w848JRxQBM0zn9IhJIiwq/mb1sZl3NrCcwH3jIzG5Pb7TsNGlMYrhn2oI11NdruEdE4tfSHn83d98GXAo85O4lJK7AlUN04qDuDOzekbXbdlHx0ebQcUQkB7W08BeYWTFwOTAtjXmynpkxcXRiuEdTOIhICC0t/LcCzwLvufsbZjYUWJa+WNmtYe6e6QvXUFunZRlFJF4tLfxr3H20u38dwN2XAxrjT9LIgV0Z0quIjTuqef2DTaHjiEiOaWnhv6uF26QFzGzPQV4N94hI3A5Y+M1snJl9D+hjZt9tdPspiSUTJUkNwz3PLF5Lda2Ge0QkPgfr8bcHOpNYKrFLo9s24IvpjZbdhvfvwrB+ndlSWcPs9zaGjiMiOeSA6wC6+yvAK2b2W3dfEVOmnDFx9ABuf/5dyuav5sxj+oaOIyI5oqVj/IVmdq+ZPWdmLzXc0posBzSc1vncknXsqqkLnEZEckVLV/5+DLgbuB9QhUqRoX06M3JgVxZ9vI2X39mwZ/ZOEZF0ammPv9bdf+Puc929ouGWTINmNtzM5jW6bTOzbyfzWdmg4SCvpmoWkbi0tPCXmdnXzazYzHo23JJp0N3fcffj3f14oASoBP6azGdlg4ZJ215cup7K6trAaUQkF7S08F8N/AB4FaiIbuUpaP9s4P1cPnB8eM8iThjUnaqaOl5Yuj50HBHJARZyQRAzexB4091/2cy+ycBkgOLi4pKysrKk2qisrKSoqKhVOVPhQDmmLdvJQ/O2c9KAQm4Z3yNYjjhlQo5MyKAcypHOHKWlpRXuXrrPDnc/6A34SnO3lrz3AJ/ZHtgI9DvYa0tKSjxZ5eXlSb83lQ6UY+3WKh/yw2l+9I+m+9aq6mA54pQJOTIhg7tyNKUce2tNDqDcm6mpLR3qGdvoNgH4KXBRUn+CPnU+id7+ulZ+TpvXr2sHThrSk+q6ep5fnPP/HCKSZi06ndPdv9X4uZl1Ax5pZdtfAqa08jOyxqQxA3j9g02ULVjNF0oOCx1HRLJYsmvuVgJHJ9uomRUBnwWeSPYzss35I/uTn2fMWraRzTurQ8cRkSzW0qUXy8zsyej2FPAOMDXZRt290t17ubtWHI/06lzIqUf2orbeeWbx2tBxRCSLtfTK3V80elwLrHD3VWnIk9MmjRnAzGUbKZu/mi+dNCh0HBHJUi3q8Xtisra3SczM2QPQWEQanDuiP+3yjTnLP2H99l2h44hIlmrpUM/lwFzgMhLr7r5uZpqWOcW6FbXjjGF9qHd4eqGGe0QkPVp6cPefgLHufrW7fwU4Cfjn9MXKXZq7R0TSraWFP8/dG88n8MkhvFcOwTkj+lFYkMcbH25m9Zaq0HFEJAu1tHg/Y2bPmtk1ZnYN8BQwPX2xclfnwgLOPjaxKMtTC9YETiMi2ehga+4eZWbj3f0HwD3AaGAM8Bpwbwz5cpKGe0QknQ7W4/9vYDuAuz/h7t919++Q6O3/d3qj5a4zh/elU/t85q/ayopPdoaOIyJZ5mCFf4i7L2i60d3LgSFpSSR0bJ/POSP6ATBNwz0ikmIHK/wdDrCvYyqDyN4mRcM9ZfM13CMiqXWwwv+Gmd3QdKOZXUdiMRZJkwnDetO1QwFvr93Oe+u3h44jIlnkYIX/28BXzexlM/t/0e0V4Hrg5rSny2GFBfmce1xi8fWy+RruEZHUOWDhd/d17n4qcCvwYXS71d3HubsuLU2zSWOi4Z4FqxsWrxERabWWzsf/N+Bvac4iTZx6ZC96dmrP8g07WbpmOyMGdA0dSUSygK6+zWAF+XmcPzIa7tE5/SKSIir8GW7PcM98DfeISGoEKfxm1t3MHjezt81sqZmNC5GjLRg7pCd9uxSyanMV81dp3RoRab1QPf47gWfc/RgSU0AsDZQj4+XnGReOLgZ0Tr+IpEbshd/MugKnAw8AuHu1u2+JO0db0jB3z1ML1lBfr+EeEWkdi3vc2MyOJzHB2xISvf0K4GZ339nkdZOByQDFxcUlZWVlSbVXWVlJUVFRayKnRGtyuDtfm76BDZX1/NtnejKiT/sgOVIpE3JkQgblUI505igtLa1w99J9drh7rDeglMS6vSdHz+8E/u1A7ykpKfFklZeXJ/3eVGptjtumL/HBt0zzH/91YdAcqZIJOTIhg7tyNKUce2tNDqDcm6mpIcb4VwGr3P316PnjwIkBcrQpDXP3PL1oDbV19YHTiEhbFnvh98QVvyvNbHi06WwSwz5yAMcN6MoRvTuxcUc1c5ZvCh1HRNqwUGf1fAv4g5ktAI4HbguUo80wMyZGZ/dogRYRaY0ghd/d57l7qbuPdveL3X1ziBxtTcPFXE8vWkt1rYZ7RCQ5unK3DRnWrwvD+3Vha1UNs9/bGDqOiLRRKvxtzERdzCUiraTC38ZMjIZ7nluyjl01dYHTiEhbpMLfxhzRuxMjB3Zlx+5aXn5nQ+g4ItIGqfC3QXvW49XZPSKSBBX+Nqhh0raXlq6nsro2cBoRaWtU+Nugw3oUceKg7lTV1PHC0vWh44hIG6PC30Y1XqBFRORQqPC3UReMKsYMXnlnA9t21YSOIyJtiAp/G9WvawdOPqIn1XX1PLd4Xeg4ItKGqPC3YQ3DPZq7R0QOhQp/G3b+yGLy84xZyzayaWd16Dgi0kao8LdhPTu1Z/xRvamtd55ZtDZ0HBFpI1T42zhN1Swih0qFv40797j+tMs35iz/hPXbd4WOIyJtQJDCb2YfmtlCM5tnZuUhMmSLbh3bccawvtQ7PL1Qwz0icnAhe/xnuvvx3twK8HJIJo3RVM0i0nIa6skC5xzbjw7t8ihfsZnVW6pCxxGRDGfuHn+jZh8AmwEH7nH3e5t5zWRgMkBxcXFJWVlZUm1VVlZSVFTUirSpke4cv3htM6+t2s3Vo7tw0fBOwXK0VCbkyIQMyqEc6cxRWlpa0eyoirvHfgMGRPd9gfnA6Qd6fUlJiServLw86femUrpzTF+w2gffMs0n3TUzaI6WyoQcmZDBXTmaUo69tSYHUO7N1NRQi62vju7XA38FTgqRI5uceUxfOrXPZ8Gqraz4ZGfoOCKSwWIv/GbWycy6NDwGPgcsijtHtunQLp/PjugHwLQFawKnEZFMFqLH3w+YZWbzgbnAU+7+TIAcWUdTNYtISxTE3aC7LwfGxN1uLphwdB+6dijg7bXbWbZuO0f36xI6kohkIJ3OmUXaF+Rx3sj+AJRpuEdE9kOFP8s0nqrZA5yqKyKZT4U/y4wb2otendqzfMNOlqzZFjqOiGQgFf4sU5Cfx/mjEsM9OrtHRJqjwp+FJo3+9OweDfeISFMq/Flo7JCe9OtayKrNVcxbuSV0HBHJMCr8WSgvz7hwVMNBXg33iMjeVPiz1MRoquanFqyhvl7DPSLyKRX+LHXC4d0Z2L0ja7ftonzF5tBxRCSDqPBnKTPTFA4i0iwV/izWsBD79IVrqK2rD5xGRDKFCn8WO25AV4b27sQnO6uZs3xT6DgikiFU+LOYme3p9Wu4R0QaqPBnuYZx/mcWr6W6VsM9IqLCn/WO7teFY/p3YWtVDbPe2xA6johkABX+HPDpcI8u5hKRgIXfzPLN7C0zmxYqQ66YGM3d8/ySdeyu08VcIrkuZI//ZmBpwPZzxpDenRg1sBs7dtfy1prdoeOISGCxL70IYGaHARcC/wF8N0SGXDNpTDELP97KnXO38MCC50PHobamhoLpYXMUeB3fq/+Iy0sPx8yCZhGJk4WYttfMHgd+BnQBvu/uE5t5zWRgMkBxcXFJWVlZUm1VVlZSVFTUirSpETrHpqo6vvPsRnbUaKinqfGHd+DGkq50ahfmC3Donw3lyN4cpaWlFe5e2nR77D1+M5sIrHf3CjP7zP5e5+73AvcClJaWeklJSVLtVVRUkOx7UykTclScUsfsuW8yZvTooDkA5i9YEDzHQ8++wYPzdzB75S4+2mn8zxUncMKgHrHnyISfDeXIrRwhhnrGAxeZ2QVAB6Crmf3e3a8KkCWnFBbk060wj16dC0NHyYgcZw7pyKWnH8+3przF4tXbuOzu1/je54Zz4+lDycvT0I9kr9i/27r7P7r7Ye4+BLgCeElFX0IZ2qczT3z9VK4dfwS19c7Pn3mbqx+ay/rtu0JHE0kbnccvOa+wIJ+fTBrBg9eU0rNTe2Yu28gFd87klXd1wZtkp6CF391fbu7ArkgIZx3Tj6dvnsC4ob3YuKOaqx+cy8+mL9VUF5J11OMXaaRf1w78/vqT+f7nhpGfZ9wzYzmX3f0qH31SGTqaSMqo8Is0kZ9nfPOso/nzjacwsHtH5q/aygX/M5Op8z4OHU0kJVT4RfajZHBPpt80gfNH9mfH7lpu/tM8fvDYfCqra0NHE2kVFX6RA+hW1I5fX3ki/3HJSAoL8nisYhUT75rF4tVbQ0cTSZoKv8hBmBlXnjyYJ795GsP6dWb5hp1c8qtX+e3sDwhx5btIa6nwi7TQ8P5dmPqN0/jyyYOorqvnp2VLuOHhCjbvrA4dTeSQqPCLHIKO7fO57ZJR/PrKE+nSoYAXlq7j/DtnMmf5J6GjibSYCr9IEi4YVczTN0+gZHAP1m7bxZfvm8Mdz79LbZ3O+ZfMp8IvkqTDehTx6ORT+OaZR+HAnS8u48v3vc7qLVWho4kckAq/SCsU5Ofx/XOH84frTqZvl0LmfriJ8++cyXOL14aOJrJfKvwiKXDqUb15+uYJnDm8D1urapj8SAU/mbqIXTV1oaOJ7EOFXyRFenUu5MFrxvLjC4+lXb7x8GsruPhXs3lv/fbQ0UT2osIvkkJmxvUThvLE18YzpFcRb6/dzqS7ZvPoGx/pnH/JGCr8Imkw6rBuTLtpApeeMJCqmjpu+ctCvjXlLbbtqgkdTUSFXyRdOhcWcPvfHc/tl4+hU/t8pi1Yw4X/M5O3PtocOprkOBV+kTS79MTDmHbTBEYO7MrKTVVcdvdr/Obl96mv19CPhBF74TezDmY218zmm9liM7s17gwicTuidyf+8rVTue40LfEo4YVYbH03cJa77zCzdsAsM3va3ecEyCISm8KCfP554ghOO6o333ts/p4lHr8ysiMFfbeEjsd7m2ooWKkcmZZjY2XqTwmOvfB74tSGHdHTdtFN33klZ5x5TF+evnkC33l0Hq++/wm3z6nm9jmzQ8dKeFE59pIBOS4e3olzJ6T2My3EKWZmlg9UAEcBv3L3W5p5zWRgMkBxcXFJWVlZUm1VVlZSVFTUirSpoRyZlyN0hjp3pr1byayPdmIW/nBbfX09eXnKkWk5xg8o4PMjuif13tLS0gp3L91nh7sHuwHdgb8BIw/0upKSEk9WeXl50u9NJeXYWybkyIQM7srRlHLsrTU5gHJvpqYG/XPm7luAl4HzQuYQEcklIc7q6WNm3aPHHYFzgLfjziEikqtCnNVTDPwuGufPA/7s7tMC5BARyUkhzupZAJwQd7siIpIQ/pC1iIjESoVfRCTHqPCLiOQYFX4RkRwT5MrdQ2VmG4AVSb69N7AxhXGSpRx7y4QcmZABlKMp5dhba3IMdvc+TTe2icLfGmZW7s1dsqwcOZ8jEzIoh3KEyKGhHhGRHKPCLyKSY3Kh8N8bOkBEOfaWCTkyIQMoR1PKsbeU58j6MX4REdlbLvT4RUSkERV+EZEco8IvIpJjVPhjZGbFZlYYOkcuM7OeGZBhn58B/VxIAzM7oiXbWtVGrhzcNbP+7r42cIYXgCOBv7j792Nsdzwwz913mtlVwInAne6e7NXQyWToB9wGDHD3881sBDDO3R+IK0OUYxkwD3gIeNoD/AKY2ZvufuLBtqU5w6UH2u/uT8SY5V/d/SeNnucDD7v7lTFmWAjs92fB3UfHmKW5n48Kdy9JVRshFmIJ5QHgwpAB3P0cMzNgRMxN/wYYY2ZjgP9D4t/iYeCMGDP8lkSx/afo+bvAo1GWOA0jserbtcBdZvYo8Ft3fzfdDZtZf2Ag0NHMTgAs2tUViHvV90kH2OdAbIUfGGRm/+juP4u++TwGvBlj+wATo/tvRPePRPdXApVxBDCzY4DjgG5N/jB3BTqktK1c6fHnsoYehJn9BPjY3R8I0MN8w93Hmtlb7n5CtG2eux8fV4ZmMp0J/B7oBMwHfujur6WxvauBa4BSoLzRru0k/vjEWWwxszzgi+7+5zjbbSaHAX8AFgJnkvgmdkegLLPdffzBtqWp7c8DFwMXAU822rUd+JO7v5qqtnKpx5/LtpvZPwJXAadHX6XbxZxhp5n1Ivo6bWanAFtjzkCU4SrgK8Ba4FskfsmOJ9HTTOlYamPu/jsSy45+wd3/kq52DiFPvZl9EwhS+M2sccfjTuAeYDbwipmd6O5x9/oBOpnZae4+K8p4KomOQdq5+1RgqpmNS2cHBNTjzwnREMOXgTfcfaaZDQI+4+4Px5jhROAuYCSwCOhDore5IK4MUY53SXyNf9DdP26y7xZ3/3lMOS4k8bV+z1d4d//XONpukuOfgSoSw247G2XZFEPbfzvAbnf3s9KdoSkzKwEeBLpFm7YA18b5R8jM+gA3AENo1Dl392tT1oYKv8TFzAqA4STGtt9x95oAGcYCPwIGs/cvVZwH7+4mMaZ/JnA/8EVgrrtfF1eGRlk+aGazu/vQuLNkEjPrSqI+hvhW+iowE6gA6hq2p/Jbogp/Fot+qR3Y4O4nZ0CeU9m3FxPbt44owzvA90l866hvlCPOM5wWuPvoRvedgSfc/XNxZcgkZnYb8F/uviV63gP4nrv/OMYM3z3Qfne/PcYsaT/2pTH+LObuaRuvPlRm9giJU1nn8WkvxkmcXRSnDe5eFnObTVVF95VmNgD4hDQeW2iOmZ3l7i/t77TOmA80n+/uP2rU9mYzuwCIrfADXWJs62CmmdkF7j49XQ2o8EtcSoERIc6bb+JfzOx+4EVgd8PGmAvdNDPrDvxfEqctOokhnzidAbxE4rROJzH81vg+zn+PfDMrdPfdAGbWEYj1gjZ3vzXO9ppjZtv59P/Bj8xsN1ATPXd375qytsL/HkouMLPHgJvcfU3gHL8HjgEW8+lQj6fywNkh5ikEOoQYS47a/x6fFhuix1uBCnefF1OG/0PiFMaHovavBZ509/+Ko/0mWQ4jcRLC+CjLLOBmd18Vd5Z0UuGXtDKzMhK/QF1InDI5l7172hfFnGehu4+Ks8395Ah+vCPK8UcS38aeJFH8LwTeIPHH8bG4iq+ZnQ+cHWV4zt2fjaPdZnI8D/yRTy/gugq40t0/G2OG5q6v2QqscPfalLShwi/pZGZnkPhl/jmJq4b37AJ+HvdBZzO7D7jD3ZfE2W6TDM0e73D3mwJkeRb4grvviJ53Bh4HLiHR64/7KvOgmjuwGveFhmY2h8S0KgujTaNIXGDYC/gHd3+utW1ojF/Syt1fATCzdg2PG0RjuXE7Dbg6OuNpN5+On8Z2OieZc7wDYBBQ3eh5DTDY3auiMea0iy7muws4FmgP5AM7UzmmfQg2RvNZTYmef4nEwfc4fQhc5+6LAaJ5rX4A/BuJYy8q/JLZzOxrwNeBoWbW+GKtLiSu0ozbeQHabGoR0B8Ierwj8kdgjplNjZ5PAqaYWScgrm9FvwSuIHHldCmJq6qPiqntpq6N8txBYojy1WhbnI5pKPoA7r7EzE5w9+WJ2S1aT0M9klZm1g3oAfwM+GGjXdvjuDo0k2Ta8Y5GuUpIfBMyYJa7lx/kLaluv9zdSxuua4i2verup8aZI1NEEwduAv4Ubfo7oDfw9yT+/4xtdRsq/CLxyLTjHZnCzGaQmDH1fhLzJ60BrnH3MTFmuIsDT8sc2/GXaAj06zT6Ywz8GtgFFDUcj2lVGyr8IvHaz3zrC2I+zpAxzGwwsJ7ExIHfITFPzq/d/b0YM1zd6OmtwL803h9NsJc1VPhFYtL4eAfwfqNdXYDZ7n5VkGCyl8ZTh8fc7p/d/XLbz6IwqewY6OCuSHz+CDyNjnfspdGcUnsJOFFcqN7wzdH9UhJn8TQwIKXXU6jwi8Qkujp3K4lTBOVTpY0edwAuA4KvjRy3Rle1H9V00sBoda6U0VCPiGQcM5vl7qfF2F7DPDmQmDK7YbnFlM+Tc4AMsQ0FqscvIkE1maIgj8Q3gFhny3T3TJidM7ahQPX4RSSoaCWuhkJUS+LK1V+4+7vBQmU5FX4RCWo/M4TuEeciKLlCQz0iEloJMBaYSqL4TwJmACtDhspm6vGLSFBm9hyJGUK3R8+7kJgSOhPmVcpKeaEDiEjOazpDaDWJtQokTTTUIyKhPQLMNbO/khjfvwTIqikSMo2GekQkuOiUzgnR0xnu/lbIPNlOhV9EJMdojF9EJMeo8IuI5BgVfsk5ZvZPZrbYzBaY2TwzS9sCKGb2spmVHvyVIvHRWT2SU8xsHDARONHdd5tZbxILfIvkDPX4JdcUAxvdfTeAu29099Vm9hMze8PMFpnZvRatah312O8wsxlmttTMxprZE2a2zMz+PXrNEDN728x+F32LeNzMipo2bGafM7PXzOxNM3vMzDpH2//TzJZE7/1FjP8WkqNU+CXXPAccbmbvmtmvo3VwAX7p7mPdfSTQkcS3ggbV7n46cDeJaQW+AYwErjGzXtFrhgP3RqskbSMxve4e0TeLHwPnRMsulgPfNbOeJM5bPy5677+n4b9ZZC8q/JJTooWqS4DJwAbgUTO7BjjTzF6Plr07Cziu0duejO4XAovdfU30jWE5cHi0b6W7z44e/57EQtmNnQKMAGab2TzgamAwiT8Su4D7zexSPp0HXiRtNMYvOcfd64CXgZejQn8jMBoodfeVZvZTEitBNdgd3dc3etzwvOF3qOkFMU2fG/C8u++z+paZnQScDVwBfJPEHx6RtFGPX3KKmQ03s6MbbToeeCd6vDEad/9iEh89KDpwDImlFWc12T8HGG9mR0U5isxsWNReN3efDnw7yiOSVurxS67pDNxlZt1JLPrxHolhny0khnI+BN5I4nOXAleb2T3AMuA3jXe6+4ZoSGmKmRVGm38MbAemmlkHEt8KvpNE2yKHRFM2iLSSmQ0BpkUHhkUynoZ6RERyjHr8IiI5Rj1+EZEco8IvIpJjVPhFRHKMCr+ISI5R4RcRyTH/HyTu6JBLEzHCAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:xlabel='Samples', ylabel='Counts'>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk import FreqDist\n",
    "\n",
    "fdist = FreqDist(tokenized_by_nltk)\n",
    "fdist.plot(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 词性标注, 词形还原，词干提取\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "[('Two', 'CD'),\n",
       " ('plus', 'CC'),\n",
       " ('two', 'CD'),\n",
       " ('is', 'VBZ'),\n",
       " ('four', 'CD'),\n",
       " (',', ','),\n",
       " ('minus', 'CC'),\n",
       " ('one', 'CD'),\n",
       " ('that', 'WDT'),\n",
       " (\"'s\", 'VBZ'),\n",
       " ('three', 'CD'),\n",
       " ('—', 'NNP'),\n",
       " ('quick', 'JJ'),\n",
       " ('maths', 'NNS'),\n",
       " ('.', '.')]"
      ]
     },
     "metadata": {},
     "execution_count": 16
    }
   ],
   "source": [
    "from nltk import pos_tag\n",
    "\n",
    "# sentences是刚才使用句子级别分词器分好的句子\n",
    "first_sentence = sentences[0]\n",
    "# 使用词级别分词器进行分词\n",
    "first_sentence_tokenized = word_tokenize(first_sentence)\n",
    "# 进行词性标注\n",
    "pos_result = pos_tag(first_sentence_tokenized)\n",
    "\n",
    "pos_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "JJ: adjective or numeral, ordinal\n    third ill-mannered pre-war regrettable oiled calamitous first separable\n    ectoplasmic battery-powered participatory fourth still-to-be-named\n    multilingual multi-disciplinary ...\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "nltk.help.upenn_tagset('JJ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "test\n",
      "test\n"
     ]
    }
   ],
   "source": [
    "# 词干提取 stemming\n",
    "\n",
    "# from nltk.stem import PorterStemmer, LancasterStemmer\n",
    "from nltk.stem.snowball import SnowballStemmer\n",
    "\n",
    "snowball = SnowballStemmer('english')\n",
    "\n",
    "print(snowball.stem('testing'))\n",
    "print(snowball.stem('tested'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "brightening\n",
      "box\n",
      "brighten\n"
     ]
    }
   ],
   "source": [
    "# 词形还原 lemmatizing\n",
    "\n",
    "from nltk import WordNetLemmatizer\n",
    "\n",
    "wnl = WordNetLemmatizer()\n",
    "\n",
    "print(wnl.lemmatize('brightening'))\n",
    "print(wnl.lemmatize('boxes'))\n",
    "print(wnl.lemmatize('brightening', pos='v'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 情感分类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "train: 1600\ntest: 400\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import movie_reviews\n",
    "import random\n",
    "random.seed(42)\n",
    "\n",
    "\n",
    "def load_movie_reviews():\n",
    "    # 取出正负样本的标号\n",
    "    pos_ids = movie_reviews.fileids('pos')\n",
    "    neg_ids = movie_reviews.fileids('neg')\n",
    "    \n",
    "    # 构建数据集\n",
    "    all_reviews = []\n",
    "    for pids in pos_ids:\n",
    "        all_reviews.append((movie_reviews.raw(pids), 'positive'))\n",
    "    \n",
    "    for nids in neg_ids:\n",
    "        all_reviews.append((movie_reviews.raw(nids), 'negative'))\n",
    "        \n",
    "    # 随机打乱\n",
    "    random.shuffle(all_reviews)\n",
    "    \n",
    "    # 切分训练集和测试集\n",
    "    train_reviews = all_reviews[:1600]\n",
    "    test_reviews = all_reviews[1600:]\n",
    "\n",
    "    return train_reviews, test_reviews\n",
    "\n",
    "train_reviews, test_reviews = load_movie_reviews()\n",
    "print('train:', len(train_reviews))\n",
    "print('test:', len(test_reviews))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(\"saving private ryan ( dreamworks ) running time : 2 hours 48 minutes . \\nstarring tom hanks , edward burns , tom sizemore and matt damon directed by steven spielberg already being hailed as the 'greatest war movie ever made , ' saving private ryan is an harrowing , saddening and riveting movie . \\nit may not be the greatest war movie in my opinion , but it's certainly one of the best war movies made , and one of the best of 1998 . \\ntom hanks stars as a captain who's troop has to find private ryan ( damon ) who has a ticket home because his three brothers have been killed in action . \\naction , drama and even some humour occur as the troop journeys through wartime france to find him . \\nafter the disappointing amistad ( 1997 ) spielberg has returned to form with this excellent movie . \\ni'm not the war movie genre biggest fan , but i found this film to be gripping , and very scary , thanks to the excellent cast , direction and terrifying battle scenes . \\ntom hanks is superb , straying away from his usually soppy dramatic roles , such as in forrest gump ( 1994 ) . \\nthis time , he plays the role with gritty realism , and is much better for it . \\noccasionally he overacts the sentimentally , but he generally delivers a fine performance . \\nedward burns , looking a lot like armageddon's ben affleck , also delivers a top notch performance , moving away from his roles in films such as she's the one ( 1996 ) tom sizemore makes less of an impact , but is still watchable , and matt damon reinforcing his position as one of the finest young actors working today . \\nspielberg directs very well , putting the audience right in the heart of the action of the battle scenes . \\nand what battle scenes they are ! \\nthey're truly terrifying , yet the audience cannot drag their eyes away from the screen . \\nthe battle scenes are filmed with a jerky hand-held camera , and the panic and confusion felt by the soldiers is emphasized by this technique . \\nthe gore and violence isn't spared either , which body parts flying , and blood spurting . \\nthis film is certainly not for kids and sensitive adults . \\nother factors help saving private ryan be a masterpiece of 90's film making . \\nthe cinematography is excellent , and the music score by john william's is also superb . \\nit is never intrusive , and adds to the drama on-screen . \\nbut while they are thousands of good things great about private ryan , there's one major flaw that detracts the genius of the film : the writing . \\nit is unusually flat , with many of the speeches strangely weak . \\nthe film never really makes any profound statements . \\nthis is not a major gripe , as private ryan is a film of action , not words . \\nstill , the script could of been a lot better . \\nthankfully , the actors help partly to rectify the situation with their great delivery of their lines . \\nsaving private ryan , in the end , is an excellent film , but not the 'greatest war movie' due to it's weak acting . \\nthis film should be viewed by everyone who has the stomach for it , as it's rewarding and extremely worthwhile . \\nit really shouldn't be missed , and dreamworks skg has finally found it's first hit movie . \\n\",\n",
       " 'positive')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 查看一个训练样例\n",
    "train_reviews[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 尝试构建第一个特征抽取器 把每一个词都看作一个特征\n",
    "def extract_feature1(text):\n",
    "    feature = {}\n",
    "    text = text.lower()\n",
    "    for word in word_tokenize(text):\n",
    "        feature[f'contain: {word}'] = True\n",
    "    return feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract_feature1(train_reviews[2][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import NaiveBayesClassifier\n",
    "import nltk\n",
    "\n",
    "def train_and_test(extract_feature, train_data, test_data):\n",
    "    training_set = nltk.classify.apply_features(extract_feature, train_data)\n",
    "    test_set = nltk.classify.apply_features(extract_feature, test_data)\n",
    "\n",
    "    classifier = NaiveBayesClassifier.train(training_set)\n",
    "    accuracy = nltk.classify.util.accuracy(classifier, test_set)\n",
    "    print(f'accuracy is {accuracy:.4f}')\n",
    "\n",
    "    return classifier\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy is 0.7450\n"
     ]
    }
   ],
   "source": [
    "model1 = train_and_test(extract_feature1, train_reviews, test_reviews)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Informative Features\n",
      "          contain: sucks = True           negati : positi =     15.1 : 1.0\n",
      "       contain: nonsense = True           negati : positi =     13.1 : 1.0\n",
      "     contain: astounding = True           positi : negati =     11.6 : 1.0\n",
      "    contain: outstanding = True           positi : negati =     11.6 : 1.0\n",
      "      contain: stupidity = True           negati : positi =     11.1 : 1.0\n",
      "      contain: atrocious = True           negati : positi =     11.1 : 1.0\n",
      "         contain: seagal = True           negati : positi =     10.4 : 1.0\n",
      "         contain: avoids = True           positi : negati =     10.3 : 1.0\n",
      "         contain: finest = True           positi : negati =      9.8 : 1.0\n",
      "       contain: one-note = True           negati : positi =      9.7 : 1.0\n"
     ]
    }
   ],
   "source": [
    "model1.show_most_informative_features()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'positive'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sentence = 'it is a wonderful movie'\n",
    "feature = extract_feature1(sentence)\n",
    "\n",
    "model1.classify(feature)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 只考虑形容词\n",
    "\n",
    "def extract_feature2(text):\n",
    "    text = text.lower()\n",
    "    feature = {}\n",
    "    tokens = word_tokenize(text)\n",
    "    for word, pos in pos_tag(tokens):\n",
    "        if pos == 'JJ':\n",
    "            feature[f'contain: {word}'] = True\n",
    "    return feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract_feature2(train_reviews[2][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "accuracy is 0.7250\n"
     ]
    }
   ],
   "source": [
    "model2 = train_and_test(extract_feature2, train_reviews, test_reviews)"
   ]
  },
  {
   "source": [
    "思考：该分类器性能不行？"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Informative Features\n",
      "    contain: outstanding = True           positi : negati =     11.6 : 1.0\n",
      "      contain: atrocious = True           negati : positi =     11.1 : 1.0\n",
      "       contain: one-note = True           negati : positi =      9.7 : 1.0\n",
      "      contain: ludicrous = True           negati : positi =      9.6 : 1.0\n",
      "          contain: fairy = True           positi : negati =      9.6 : 1.0\n",
      "     contain: unbearable = True           negati : positi =      9.0 : 1.0\n",
      "          contain: worst = True           negati : positi =      9.0 : 1.0\n",
      "         contain: truman = True           positi : negati =      9.0 : 1.0\n",
      "     contain: accessible = True           positi : negati =      8.3 : 1.0\n",
      "       contain: seamless = True           positi : negati =      8.3 : 1.0\n"
     ]
    }
   ],
   "source": [
    "model2.show_most_informative_features()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 利用bigram\n",
    "\n",
    "from nltk import ngrams\n",
    "\n",
    "\n",
    "def extract_feature3(text):\n",
    "    text = text.lower()\n",
    "    feature = {}\n",
    "    tokens = word_tokenize(text)\n",
    "    for word in tokens:\n",
    "        feature[f'contain: {word}'] = True\n",
    "    for bigram in ngrams(tokens, 2):\n",
    "        bigram = ' '.join(bigram)\n",
    "        feature[bigram] = True\n",
    "    return feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy is 0.7775\n"
     ]
    }
   ],
   "source": [
    "model3 = train_and_test(extract_feature3, train_reviews, test_reviews)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Informative Features\n",
      "          contain: sucks = True           negati : positi =     15.1 : 1.0\n",
      "       contain: nonsense = True           negati : positi =     13.1 : 1.0\n",
      "                to waste = True           negati : positi =     13.1 : 1.0\n",
      "              matt damon = True           positi : negati =     12.3 : 1.0\n",
      "               insult to = True           negati : positi =     11.7 : 1.0\n",
      "            saving grace = True           negati : positi =     11.7 : 1.0\n",
      "               . cameron = True           positi : negati =     11.6 : 1.0\n",
      "     contain: astounding = True           positi : negati =     11.6 : 1.0\n",
      "              fairy tale = True           positi : negati =     11.6 : 1.0\n",
      "                 our own = True           positi : negati =     11.6 : 1.0\n",
      "    contain: outstanding = True           positi : negati =     11.6 : 1.0\n",
      "      contain: stupidity = True           negati : positi =     11.1 : 1.0\n",
      "                  so why = True           negati : positi =     11.1 : 1.0\n",
      "                waste of = True           negati : positi =     11.1 : 1.0\n",
      "                 & robin = True           negati : positi =     11.1 : 1.0\n",
      "                 awful . = True           negati : positi =     11.1 : 1.0\n",
      "                batman & = True           negati : positi =     11.1 : 1.0\n",
      "      contain: atrocious = True           negati : positi =     11.1 : 1.0\n",
      "           quite frankly = True           negati : positi =     11.1 : 1.0\n",
      "              and boring = True           negati : positi =     10.4 : 1.0\n"
     ]
    }
   ],
   "source": [
    "model3.show_most_informative_features(20)"
   ]
  },
  {
   "source": [
    "## 进一步的改进\n",
    "- 特征选择\n",
    "    - ngram会使得特征空间的大小快速增加\n",
    "    - 特征选择旨在初步剔除那些对分类无益的特征\n",
    "- 考虑词频信息，如TF-IDF\n",
    "    - 特征从一个词是否出现，变为这个词出现了多少次。特征的信息含量增加。\n",
    "- 处理否定词\n",
    "    - 将否定词+形容词，变为一个新的词\n",
    "    - 'not good' -> NOT_good\n",
    "- ..."
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.9.0 64-bit",
   "metadata": {
    "interpreter": {
     "hash": "12c989a2272087144a907f3af46e789b70a90abf7fd5b4372cac90cccd9eaa13"
    }
   }
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0-final"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}